Collecting and Processing WhatsApp Data Donations

Julian Kohne

2025-03-24

Overview

Goal:
Semi-ineractive Walk-through of the process for preprocessing, collecting, and analyzing donated WhatsApp Chat Log data.

Time Block
09:00 - 09:15 Presentation: Overview of WhatsApp Chat Log Data
09:15 - 09:45 Code along: Exporting & Parsing WhatsApp Chat Log Data
09:45 - 09:55 Presentation: Anonymization & Consent Checking
09:55 - 10:10 Code along: Anonymization & Consent Checking
10:10 - 10:20 Presentation: ChatDashboard for Data Donation Studies
10:20 - 10:45 Code along: Installing and adapting ChatDashboard
10:45 - 11:00 Discussion, Q&A

Code-along


What you need to code-along:

Overview:
WhatsApp Chat Log Data

Why Chat Log Data?

Chat log data offer extraordinarily rich, high quality data about everyday interpersonal interactions.

  • MIMs are abundant in close relationships
    (Kemp, 2020, 2025)
  • Measurement of everyday behavior
    (Kemp, 2020, 2025)
  • Non-intrusive
  • Non-public Interactions
  • High temporal resolution
  • Retrospective data collections
    • reduced subjectivity bias
    • reduced memory effects
    • reduced reactivity
    • reduced social-desirability bias

Previous Research

Why WhatsApp?


  • most popular MIM in the world
    (Kemp, 2025)

  • 2 Billion monthly active users
    (Kemp, 2025; Montag et al., 2015)

  • available for Android and iOS

  • Unobtrusively logs interactions

  • Option for chat-log exports

  • Retrospective, highly granular communication data

How can we get WhatsApp data?

WhatsApp Chat log data can be obtained in at least 2 different ways:

1) Joining the conversation

  • researchers identify target conversations or groups

  • researchers join the conversation

  • researchers export the chat log data from the group using the WhatsApp export function

  • Advantanges:

    • Reduced burden for participants
  • Disadvantages:

    • Data collection is not retrospective

    • More effort for researchers

    • Participants are either aware of being studied or not asked for consent

Joining the Conversation

How can we get WhatsApp data?

WhatsApp Chat log data can be obtained in at least 2 different ways:

1) Data Donations

  • researchers identify target conversations or groups

  • researchers ask participants to export chat logs

  • participants donate the exported chat logs to the researchers

  • Advantanges:

    • Retrospective data collection

    • full transparency for participants

    • Active, opt-in consent

  • Disadvantages:

    • More effort for participants

WhatsApp Data Donations

Individual Chats vs. Complete Backup

  • Users in WhatsApp can export:

    • An individual chat with a person a group

      • exported directly to the persons phone or send with service of choice

      • unencrypted .txt file or zip file

    • A complete backup file of all their conversations, including media files

      • Google Account or iCloud necessary

      • saved as a backup file, can not easily be interacted with manually

      • Designed for data recovery, not data sharing

      • AFAIK, no tools leverage this as a source of data

How can WhatsApp Chat Logs be exported?

What kind of data do we get?


With media

  • Zip archive
  • The last 10.000 messages in the conversation as a.txt file
  • Media files from all included messages
    • images
    • videos
    • audio files (voice messages)
    • contacts
  • Sent locations (Google Maps links in .txt file)
  • phone and video calls through WhatSapp are indicated in the .txt file

Without media

  • Single .txt file or Zip archive (if contacts are included)
  • last 40.000 messages in the chat
  • No media files (images, audio, video)
  • Sent contacts are included as .vcf files
  • file names of media files are still included in the .txt file as reference
  • phone and video calls through WhatSapp are indicated in the .txt file

What data do we not get?


No metadata

  • WhatsApp settings
    • read receipts
    • self-deleting messages
  • Profile information
    • profile picture
    • status
    • info message
  • when was the chat exported
  • Phone notification settings

What data do we not get?


Basic Information

  • Demographics
    • age
    • gender
    • education level
    • etc.
  • Psychological constructs
    • personality
    • attachment styles

What data do we not get?


Relationship Information

  • How long do they know each other?
  • Do they cohabit?
  • Relationship type (friends, family, romantic partner?)
  • Relationship quality (conflict, support, intimacy)

What data do we not get?


Other communication channels

  • face-to-face interactions
  • phone or video calls through other apps
  • Other mobile messaging apps

Code along:
Exporting and parsing
WhatsApp Chatlog data

Anonymization

  • WhatsApp chat logs contain a lot of personal identifiable information (PII).

  • There are multiple good reason to remove these:

    • Parsimony: Researchers should only work with data that they absolutely need

    • Ethics Boards: Getting approval from an ethics board is easier for anonymous data

    • Participation Willingness: People are more likely to share their data if it’s anonymized

    • Consent: Consent might only be necessary from the data donor, not from all participants

    • FAIR data: Data can be shared much more easily when it’s anonymized

  • However: Depending on the research question at hand, raw data might be necessary.

Anonymization

  • Essentially, there are two ways to anonymize data:

    1. Delete the parts of the data that contain PII

    2. Alter the parts of the data that contain PII

      • Aggregate

      • Pseudonymize

      • Reduce

  • If possible, researchers like to go with option 2 whenever feasible to retain as much information as possible.

Anonymizing Chat Logs in WhatsR

Column Name Description PII Anonymization
DateTime Timestamp (yyyy-mm-dd hh:mm:ss) no none
Sender Sender name (incl. system msgs) yes placeholder
Message User message text yes deleted
Flat Simplified message yes deleted
TokVec Tokenized message (list of words) yes deleted
URL URLs/domains yes domains
Media Media filenames yes file ext
Location Location URLs/indicators yes indicator
Emoji Emoji glyphs no none
EmoDesc Emoji text no none
Smilies Smileys no none
SysMsg System messages yes deleted
TokCount Token count no none
TimeOrd Timestamp order no none
DispOrd Chat display order no none


  • Several column are completely unproblematic as they do not contain any PII

  • Some columns are problematic but can be anonymized

  • The columns containing the sent messages are highly problematic because the can contain any form of PII in any format.

  • While sophisticated anonymization software exists to potentially anonymize this. WhatsR deletes these columns for anonymization.

ChatDashboard for Data Donation Studies

ChatDashboard for Data Donation Studies

  • WhatsR can be used to parse, anonymize and check consent for donated WhatsApp chat logs
  • However, we still need a way for the data to get from the participant to the researchers
  • That is: While WhatsR can act as a backend for data processing, we still need a frontend
  • A frontend should fulfill multiple requirements:
    • Easy to use for participants
    • “Easy” to set up for researchers
    • Modifiable by researchers
    • Data encryption in transit and at rest
    • Give Participants feedback about own behavior
    • Allow participants to manually remove data before donation
    • Link data donations to corresponding survey data

Researcher Setup

Steps to set up the ChatDashboard for a data donation study:

  • Local:
    • Download ChatDashboard from GitHub
    • Create your own pair of RSA keys for encryption and decryption
    • Install necessary system dependencies and R-packages (follow GitHub Readme)
    • Configure ChatDashboard Settings based on your requirements
    • Test manually
  • Online:

Modifiability

  • ChatDashboard is an R shiny web app. It can be modified and customized without knowing web development frameworks like React, Angular or Vue.js

  • The processing of data is handled by WhatsR and can be “easily” modified or extended with your own code

Encryption

  • ChatDashboard uses SSL encryption for data in transit and RSA encryption for data at rest

  • Researchers generate their own RSA keys, so they are the only ones able to access the data after it’s encrypted

  • Researchers can host the ChatDashboard on their own server, enabling them to add additional layers of encryption if desired

Manual Data Removal

  • ChatDashboard has an interactive interface that allows participants to manually remove data before donation

  • Participants can remove whole columns or individual messages

  • Consent checking and anonymization are done after manual removal and not affected by participants choices

  • Researchers can quantify later how much data was removed by participants

Linking to Survey Data

  • A participant ID can be forwarded to ChatDashboard via an URL-parameter

  • The participant ID then becomes a valid username for logging in to ChatDashboard

  • The participant ID is attached to the data donation in the file name and can be used to link the chat to the corresponding survey data

Code along:
Installing and adapting ChatDashboard

Question and Answers

References

Brinberg, M., & Ram, N. (2021). Do new romantic couples use more similar language over time? Evidence from intensive longitudinal text messages. Journal of Communication, 71(3), 454–477.
Brinberg, M., Vanderbilt, R. R., Solomon, D. H., Brinberg, D., & Ram, N. (2021). Using technology to unobtrusively observe relationship development. Journal of Social and Personal Relationships, 38(12), 3429–3450.
Bursztyn, V. S., & Birnbaum, L. (2019). Thousands of small, constant rallies: A large-scale analysis of partisan WhatsApp groups. Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 484–488. https://doi.org/10.1145/3341161.3342905
Caetano, J. A., Oliveira, J. F. de, Lima, H. S., Marques-Neto, H. T., Magno, G., Meira Jr, W., & Almeida, V. A. (2018). Analyzing and characterizing political discussions in WhatsApp public groups. https://arxiv.org/abs/1804.00397.
Corten, R., Boeschoten, L., Carrière, T. C., Jongerius, S., Struminskaya, B., Mulder, J., Zahedi, P., Najafabadi, S. N., & Mendrik, A. (n.d.). Assessing mobile instant messenger networks with donated data.
Freitas Melo, P. de, Vieira, C. C., Garimella, K., Melo, P. O. V. de, & Benevenuto, F. (2020). Can WhatsApp counter misinformation by limiting message forwarding? Complex Networks and Their Applications VIII: Volume 1 Proceedings of the Eighth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2019 8, 372–384.
García-Gómez, A. (2018). Managing conflict on WhatsApp: A contrastive study of british and spanish family disputes. Journal of Language Aggression and Conflict, 6(2), 320–343. https://doi.org/10.1075/jlac.00015.gar
Garimella, K., & Eckles, D. (2020). Images and misinformation in political groups: Evidence from WhatsApp in india. https://arxiv.org/abs/2005.09784.
Garimella, K., & Tyson, G. (2018). WhatsApp, doc? A first look at whats-app public group data. Proceedings of the Twelfth International AAAI Conference on Web and Social Media (ICWSM 2018). https://ojs.aaai.org/index.php/ICWSM/issue/view/270
Hase, V., & Haim, M. (2024). Can we get rid of bias? Mitigating systematic error in data donation studies through survey design strategies [Journal Article]. Computational Communication Research, 6(2), 1. https://doi.org/https://doi.org/10.5117/CCR2024.2.2.HASE
Jensen, M., & Hussong, A. M. (2021). Text message content as a window into college student drinking: Development and initial validation of a dictionary of “alcohol-talk.” International Journal of Behavioral Development, 45(1), 3–10.
Kemp, S. (2020). DIGITAL 2020: 3.8 billion people use social media. https://wearesocial.com/uk/blog/2020/07/digital-use-around-the-world-in-july-2020.
Kemp, S. (2025). Digital 2025 global overview report. https://datareportal.com/reports/digital-2025-global-overview-report.
Kohne, J., & Montag, C. (under review). Put your data where your mouth is: Investigating the viability of WhatsApp data donations for interpersonal relationship research.
Machado, C., Kira, B., Narayanan, V., Kollanyi, B., & Howard, P. (2019). A study of misinformation in WhatsApp groups with a focus on the brazilian presidential elections. In L. Liu & W. Ryen (Eds.), Companion proceedings of the 2019 world wide web conference (pp. 1013–1019). https://doi.org/10.1145/3308560.3316738
Melo, P., Messias, J., Resende, G., Garimella, K., Almeida, J., & Benevenuto, F. (2019). WhatsApp monitor: A fact-checking system for WhatsApp. Proceedings of the 2019 International AAAI Conference on Web and Social Media, 676–677. https://ojs.aaai.org/index.php/ICWSM/article/view/3271
Montag, C., Błaszkiewicz, K., Sariyska, R., et al. (2015). Smartphone usage in the 21st century: Who is active on WhatsApp? BMC Research Notes, 8, 331. https://doi.org/10.1186/s13104-015-1280-z
Narayanan, V., Kollanyi, B., Hajela, R., Barthwal, A., Marchal, N., & Howard, P. N. (2019). News and information over facebook and WhatsApp during the indian election campaign. Data Memo, 2, 1–8. https://demtech.oii.ox.ac.uk/research/posts/news-and-information-over-facebook-and-whatsapp-during-the-indian-election-campaign
Resende, G., Melo, P., Sousa, H., Messias, J., Vasconcelos, M., Almeida, J., & Benevenuto, F. (2019). (Mis)information dissemination in WhatsApp: Gathering, analyzing and countermeasures. Proceedings of WWW ’19: The World Wide Web Conference, 818–828. https://doi.org/10.1145/3308558.3313688
Seufert, A., Poignée, F., Hoßfeld, T., & Seufert, M. (2022). Pandemic in the digital age: Analyzing WhatsApp communication behavior before, during, and after the COVID-19 lockdown. Humanities and Social Sciences Communications, 9(1).
Seufert, A., Poignée, F., Seufert, M., & Hoßfeld, T. (2023). Share and multiply: Modeling communication and generated traffic in private WhatsApp groups. IEEE Access, 11, 25401–25414.
Seufert, M., Hoßfeld, T., Schwind, A., Burger, V., & Tran-Gia, P. (2016). Group-based communication in WhatsApp. 2016 IFIP Networking Conference (IFIP Networking) and Workshops, 536–541.
Sprugnoli, R., Menini, S., Tonelli, S., Oncini, F., & Piras, E. (2018). Creating a WhatsApp dataset to study pre-teen cyberbullying. Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), 51–59. https://doi.org/10.18653/v1/W18-5107
Ueberwasser, S., & Stark, E. (2017). What’s up, switzerland? A corpus-based research project in a multilingual country. Linguistik Online, 84(5).
Underwood, M. K., Rosen, L. H., More, D., Ehrenreich, S. E., & Gentsch, J. K. (2012). The BlackBerry project: Capturing the content of adolescents’ text messaging. Developmental Psychology, 48(2), 295.
Verheijen, L., & Stoop, W. (2016). Collecting facebook posts and WhatsApp chats: Corpus compilation of private social media messages. Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno, Czech Republic, September 12-16, 2016, Proceedings 19, 249–258.